Fast and accurate semi-supervised protein homology detection with large uncurated sequence databases
نویسندگان
چکیده
Establishing structural and functional relationship between sequences in the presence of only the primary sequence information is a key task in biological sequence analysis. This ability is critical for tasks such as inferring the superfamily membership of unannotated proteins (remote homology detection) when no secondary or tertiary structure is available. Recent methods such as profile kernels and mismatch neighborhood kernels have shown promising results by leveraging unlabeled data and explicit modeling mutations, insertions and deletions using mutational neighborhood. However, the size of such neighborhood exhibit exponential dependency on the cardinality of the alphabet set which incurs expensive cost for kernel evaluation and hence hinders the use of such powerful tools. Moreover, another missing component in previous studies for large-scale semi-supervised protein homology detection is a systematic and biologically motivated approach for leveraging the unlabeled data set. In this study, we propose a systematic and biologically motivated approach for extracting relevant information from unlabeled sequence databases. We also propose a method to remove the bias caused by overly represented sequences which are commonly seen in the unlabeled sequence databases. Combining these approaches with a class of kernels (sparse spatial sampling kernels, SSSK) that effectively model mutation, insertion, and deletion, we achieve fast and accurate semi-supervised protein homology detection on three large unlabeled databases. The resulting classifiers based on our proposed methods significantly outperform previously published state-of-the-art methods in performance accuracy and exhibit order-of-magnitude differences in experimental running time.
منابع مشابه
A fast, large-scale learning method for protein sequence classification
Motivation: Establishing structural and functional relationships between sequences in the presence of only the primary sequence information is a key task in biological sequence analysis. This ability can be critical for tasks such as making inferences of the structural class of unannotated proteins when no secondary or tertiary structure is available. Recent computational methods based on profi...
متن کاملExpression analyses of endoglucanase gene in Penicillium oxalicum and Trichoderma viride
The expression of endoglucanase gene and protein profile belonging to two fungal species, Penicillium oxalicum 1SMS and Trichoderma viride 156MS with high cellulase enzyme activity, was investigated. Fungal isolates were cultured on inducer CMC medium and then the amount of released sugar and protein were assayed every three days for a month, using arsenate molybdatereagent and Bradford method,...
متن کاملProtein remote homology detection based on auto-cross covariance transformation
Protein remote homology detection is a critical step toward annotating its structure and function. Supervised learning algorithms such as support vector machine are currently the most accurate methods. The position-specific score matrices (PSSMs) contain wealthy information about the evolutionary relationship of proteins. However, the PSSMs often have different lengths, which are difficult to b...
متن کاملSVM-HUSTLE - an iterative semi-supervised machine learning approach for pairwise protein remote homology detection
MOTIVATION As the amount of biological sequence data continues to grow exponentially we face the increasing challenge of assigning function to this enormous molecular 'parts list'. The most popular approaches to this challenge make use of the simplifying assumption that similar functional molecules, or proteins, sometimes have similar composition, or sequence. However, these algorithms often fa...
متن کاملSpatially-constrained sample kernel for sequence classification
Kernel-based learning methods provide some of the most accurate results in many sequence analysis and prediction tasks [1, 2, 4, 6]. However, the improved accuracy is often achieved at the cost of high computational complexity of training and prediction. We propose a new family of the string-based kernel classification methods for the sequence analysis tasks that offer low computational cost an...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008